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Introduction 


Many diverse techniques have been devised to discover structure within complex 
bodies of data by unsupervised fashion, l.e., cluster analysis (B,.L1, 1965; 
Cormack, 1971; Anderberg, 1973; Duran et al., 197A; Everltt, 197A). The tech- 
niques attempt to group data points, usually In a multidimensional space, Into 
cluster such that all points within a cluster possess Intrinsic similarity 
relatively distinct from the others. 

Hence, application of the techniques to the data often reveals unexpected 
characteristics Inhibited In the data structure. But It has often suffered 
from lack of adequate mathematical description, and either too many suboptl- 
mal solutions, or requirements of astronomical enumerations, in the course of 
searching for the optimal solution. 

The objective of this paper was to present an approach for extraction of 
Gaussian clusters using discrete probability density estimates as the first 
step for an iterative unsupervised classification. The approach is based on 
the presumptJ that there are parts of the feature space in which data popu- 
lations are • y dense, separated by parts of low density. An attempt was 
made to devls. a method suTtable for processing a moderate volume of multi- 
variate measurements, such as satellite multlspectral scanner (MSS) data. 

Parameterization for Clustering Function 

Clustering is often the first step in analyzing a set of data whose character- 
istics have not yet been revealed. It is common to begin with the assumption 
of normal distribution if no knowledge about the data structure is available. 
In multivariate mixture distribution, the normality means multimodal Gaussian 
distribution in multidimensional space. Data surrounding each mode can be 
interpreted as a cluster. A group of data representing a real class may con- 
sist of two or more unimodal clusters and have multimodal distribution. Such 
data are divided into two or more subgroups so that the unimodal distribution 
can be applied to each subgroup. 
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2. Earth Survey Applications ’:^ivision. National Aeronautics and Space Admin- 
istration/Goddard Space Flight Center (NASA/GSFC) Greenbelt, Maryland. 

3. Colorado State University, Fort Collins, Colorado. 

A. Texas A&M University, College Station, Texas. 


1 




Under the assumption of unimodal normality in a cluster, the probability 
density function is given by (Duda and Hart, 1973) 


P^(x) 


^ - |(x-iLi) r 

(2-rr)d/2 |c^|l /2 


(I) 


where 




= a priori probability of cluster i (w^^) 
“ d-by-d covariance matrix of cluster i 
= inverse of the covariance matrix 
= pattern (d-component column) vector 
•= mean vector of patterns of cluster 1 
= tranpose of a matrix 
= dimension of feature space (Integer) 

= base of natural logarithm. 


This is the multivariate normal distribution function for the cluster called 
'w^". The clustering process is carried out by finding all sets of cluster 
parameters: mean vector covariance matrix and a priori probability 

for all the clusters, '^or a given set of N measurements, Y = {x , the 

multivariate mixture probability p(x) may be estimated and then postulated as 
the sum of all the cluster probabilities p^(x) : 



all 1 


( 2 ) 


It may be computed from discrete measurement data by 


P(x) 


= sum of population in a volume element AV (or cell) 
total population (N) 


(3) 


This is the probability density of the mixture of all probable clusters. 
Decomposition of the mixture probability into a set of subgroups having unimodal 
distribution is the task of the clustering process. It is intuitive to divide 
the region into two parts by the boundary where 


Pj^(x) = Pj(x) , 1 / .i . 
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(A) 


Furthermore, It Is likely to declare that x belongs to the cluster 1 (l.e., 
xcwj^) If 


Pl^x) > all J »» i 


(5) 


and, otherwise, x does not belong to the cluster 1 (l.e., x ^ Wj^). This Is 
the maximum J.lkellhood classification or decision rule (Duda et al., 1975). 

The region Is defined by the subspace where the Inequality (Eq. 5) Is satis- 
fied. This Intuition will be exploited to extract a cluster from the data 
set by a clustering function propcsed as 

P(x) 

Gi(x) ■ Jin 


( 6 ) 


Pi(x) 


where iln denotes natural logarithm. 


The usefulness of the clustering function for the maximum likelihood decision 
rule may be seen in the following properties: 


1. Gj^(x) * 0 for the distribution of a single (unmlxed) class, 

2. Gj^(x) > 0 for any mixture distribution of two or more different 
clusters, 

3. For a two-cluster mixture distribution, 

a. G|^(}c) ^ 2 ( 21 ) = on the boundary between the clusters. 


G (xeRj < Jln2 
G,(xVr,) > <'n2 


} 1 - 1 . 2 . 


(7) 

( 8 ) 

(9) 

( 10 ) 


PROOF: 


1. It Is evident by the definition since p(x)=p^(x) for the single class 
data , 

2. P(x) - Pi^2i)* 

all J 


then, 

Gi(x) 


Jin 


P (x) 

Pi(x) 


(q.e.d. ) 


0 . 


3a. On the boundary between the two clusters, 
Pl(x) = P2^2L) = Pi(2L) • 


Pi(x) 


Pi(x) 




That is. 


Gj(x) - G2 (x) - ln2. (q.e.d.) 

3b. Let 1 J . And the maximum likelihood decision rule shows 

Therefore, 

p (xcR.) + p (xeR ) 

G,(xeRj = i?.n ^ ^ J ^ 


( 11 ) 


mixture 

= p . (x^R. ) . 
1 - 1 


> in2. (q.e.d.) 

The last property of the clustering function suggests that a feature space R 
can be divided into two regions: 1) a region belonging to cluster i and 

2) another out of the region, in the case of two-cluster mixture, as such 
jc e w (cluster i) if 

^ G^(x) < J?n2 (12) 

and otherwise, x ^ w^. This criterion will be utilized in this study for the 
extractl< of one-cluster data from the whole set of data. It requires only 
. nowledgv; of a set of the parameters for a single cluster each time. Elements 
of a prospective cluster can be extracted from the set of data without knowing 
characteristics of the other clusters based on this criterion. This fact is 
the beauty of the clustering function G£(jc). 

Hill-Sliding Strategy 

The clustering function, however, cannot be evaluated unless the set of cluster 
characteristic parameters are estimated. The first problem in clustering is 
to find good initial estimates of the parameters employed in most cases. It is 
intuitively viewed that a cluster has a mode, which has the highest probability 


p^(xeR^) 


= In 


1 + 


Pj (x^R^) 
P^(xeR^) 


< £n2 . 

Similarly, for the two-cluster 
Pj(2i^R^) = p^(xeRj) < Pj(>ieRj) 


Gi(x<Ri) = in 


Pj (x^R^) 


1 +■ 


p^(x^R^) 
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density In the cluster. The location of a cluster mode depends on the charac- 
teristics of distribution type or governing law of the distribution, but It Is 
usually observed near the gravitational center (centroid) of the cluster 
(Fig. 1). 

Suppose this presumption Is acceptable In a set of data to be analyzed. Then 
at least one candidate mode which has the highest probability density can be 
picked up. Such a mode Initiates the first clustering by fusing all proba- 
bility cell points that may be categorized Into one cluster. 

A major problem faced In clustering Is that the types of data distribution are 
generally not known in advance; thus each cluster may have a different char- 
acteristic shape in Its distribution. Due to the absence of prior knowledge 
on characteristics of expected clusters, each cluster was Initiated with the 
assumption of isotropic normal distribution, at least in the immediate neigh- 
borhood of a mode candidate. The group of data initially coalesced into a 
cluster reflects the distribution characteristics of the forming cluster In 
some degree since the theoretical shape of its probability contour surface 
maintains near the mode as well as throughout the region of a cluster. Dis- 
tortion of its shape may be observed usually in the regions of Its tails or 
valleys where distributions are affected by neighbor clusters. A simple 
Euclidean distance measure between measurement points can be used In cluster- 
ing data near an apparent centroid without Introducing large trial errors. 

The question is where to terminate the fusing process to avoid picking up 
data points probably originated from different clusters. The values of param- 
eters for termination of Ihltlal fusion process are threshold values of 
clustering. 


One of the threshold parameters is derived in d-varlate space. A differential 
volume AV(r) at radius r from a cluster center is defined by 

AV(r) « r^*"^ Ar (13) 


where Ar is a small segment of the radius r and symbol “ denotes the propor- 
tionality. This differential volume can be viewed as a hyper-shell (called 
simply shell hereafter) enclosed by two concentric hyperspherical surfaces 
(Fig. 2) . A series of concentric shells around a mode are drawn with in- 
creasing r by Ar. The number of cluster elements, AN(r2), in a shell is 
proportional to the volume of the shell multiplied by average population 
density In the shell: 


AN(r^) 


“ exp 



(14) 


The exponential term in Eq. 14 is that of an isotropic normal distribution with 
standard deviation o. This relation can be rearranged by employing squared 
radius r^ as 


d-2 2 

1' Ar 


exp 



(15) 
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Bivariate Population Distribution o£ 400 Landsat Data 
and the First Three Most Probable Candidates for Modes 
of Clusters (In circles). These typical example data 
were taken from the Landsat data :>ver a portion of 
Korean west coast (Park and Miller, 1978). The num- 
bers are occurrences of bivariate data In each block 
formed by both discrete MSS bands 6 and 7 data. Visu- 
ally, three clusters and their probable mode positions 
were distinguished without difficulty. 


Example Shell Volumes: 
AV,.^ec r^Ar « 

AVj ^ cc <r r Ar* 

AVj ^ oc r Ar oc Ar* 

AV,,^a:Ar a Ar*/r 


Fig. 2. A Differential Volume (llyperspherlcal 

Shell) In Three-Dimensional Space. Ex- 
ample formulas are given for one- through 
four-dimensional shells. Subscript i-d 
denotes 1-dlmenslon. 





where Che term In the left-hand aide la a generalized mean population density 
In a shell at distance r. The parameter la the variance in the population 
distribution. The right-hand side of this relationship is a monotonlcally de- 
creasing function with Increasing r^ (Fig. 3). 


C AN \ 2 

vs. r may reveal a family of straight lines 

' 1 

having the slope of - — - (Fig. 3b). A group of data, can be considered as 

2a^ 

originating from the same class if the estimates of shell population densities 
fall near a straight line. The slope of shell population data will remain 
fairly constant near the center of a cluster, but may change significantly 
when populations of other clusters enter into the shell. The squared radius 
at which the first significant change of the slope is detected is the threshold 
value (r^) for initiation of clustering. Such a change occurs when the se- 
quential searching point attempts Co cross a valley and Chen to climb a hill 
consisting of other cluster data (Fig. 4). 

There are several possibilities which may introduce slope changes in the cas« 
of anlstroplc distribution of the data. A plot of two-dimensional population 
distribution will be utilized to visualize some of these causes (Fig. 4). One 
of them is Che case where a cluster is well separated from Che others even 
though the isotropic assumption is employed and the threshold value covers 
nearly the whole region where most of cluster data are located (for example, 
cluster A in Fig. 4). Another case is when the group of one-cluster data are 
closely neighbored with the others (for example, cluster B in Fig. 4). Con- 
siderable overlaps between clusters may exist in this case. 

The threshold value r^ was given by the largest value of r^ satisfying that 

e(r2) < Min (0^, 0) (16) 


where 



( 17 ) 
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A Curve of Eq. 15 and Data of a Hypothetical Two- 
Cluster Mixture. Data of a unlmodal Isotropic 
distribution yields an exponential curve shown in 
(a). Discrete data of a two-cluster mixture may 
produce a plot shown in (b) , where two straight lines 
are approximate moving averages of two parts divided 
by The first straight line represents the popu- 

lation distribution of the first cluster with the 
parameter r^^ will be used as a threshold value 

for the cluster. 


Contours of a Mixture Probability 
Density Function (or Population Dis- 
tribution) in a Two-Dimensional Feature 
Space. The mixture p.d.f. consists of 
three unlmodal p.d.f. 's. A, B and C 
points are the modes of the clusters, 
r^.^ or r{-Q may be one of the threshold 
values estimated by the method shown 
in Fig. 3 (b) . They are Interpreted 
as the shortest Euclidean distance to 
a valley, which Is the natural boundary 
between the cluster and Its neighbor 


'c 


- e + fe Sq 


and 

0^ - updated critical slope up to the previous estimate 

0 - updated average slope of all previous estimates 

£q - positive empirical constant (about 2) 

Sq - updated standard deviation of Q. 

The first criterion (6<0(.) given in Eq. 16 prevents other cluster cells from 
merging into the cluster under formation. The second criterion ( 0<O) dis- 
tinguishes the cluster cells from the others which may cause violation of the 
normal distribution laws when they merge into the clusters. Values of the 
slope parameter must be less than zero for normally distributed data. Once 
the threshold value is found, a nc’,.’ initial cluster is formed by fusing cells 
closer than the distance corresponding to the threshold value. This initial 
cluster leads to computation of a set of parameters which will characterize 
the early stage of the cluster. 

Updating the Initial Cluster 

A group of data immediately surrounding (i.e. , within the threshold value r^; 
from) a mode candidate formed an Initial cluster under the assumption of the 
isotropic normal distribution. The parameters estimated for this initial 
cluster reflects to a certain degree the distribution characteristics even at 
its early stage, since the shape of its probability contour is retained through- 
out all the region of a cluster. Hence, the isotropic assumption is no further 
needed when the clustering function given by Eq. 6 is evaluated. 

To compute the clustering function, the a priori probability of the cluster 
should be known as well as the other parameters. It is one of the uncertain 
parameters especially at this very initial step. The a priori probability of 
a cluster in the mixture distribution is computed by 

p population in cluster i 
i total population 

This value changes whenever any date are merged into or deleted from a cluster. 
Other changing parameters are the position vector of the cluster centroid 
(mean) and covariance matrix. All of these parameters as well as random com- 
ponents of the data, contribute to fluctuation of estimates. 

It was shown that the expected value of the clustering function would be smaller 
than fi-n2 for any cell data within a well-defined cluster region. Values of 
Gi(x) computed at this stage, however, may not be close to those expected at 
the final stage. They may range from negative to large positive values mainly 
because initial estimates of cluster characteristic parameters deviate from 
reasonable values and/or because the data contain random or noisy components. 
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To allow for a cercaln level of flc.ctuatlons In estimates, especially at a 
formative stage, a flexible criterion value rather than £.n2 as in Eq. 9 is 
employed as 

Gc ' Gi f fr * ?G 

where 

Gj, ■ critical value of Gj^(j{) 

■ average value of Gj:(x) 
fg ■ empirical constant (about 2.) 

Sq ■ standard deviation of Gj^(x) 

A cell is tested for the membership of the cluster under formation by the 
criterion: 


Gj^(x) < Max(Gj,, .I?n2) (21) 

[t will be rejects’ If this inequality is not satisfied. The criterion value 
is continuously Ui><‘.ited as a new member is merged into the cluster. The empir- 
ical constant f^ as well as £q in Eq. 18 is not a sensitive parameter, but 
selection of its value depends upon the detail required in cluster divisions 
in Ihu end result. y 

»■ 

The criterion test is always applied first to the cell having the highest 
probability density among the remaining cells and then the next test is per- 
formed. In this way the present estimate of G^Cx) would be very close to those 
of the sample points that just Joined the group in the previous steps, if the 
point is a strong candidate of the cluster. Otherwise, it would be of far 
greater value than those in the region, especially at the earlier stage of 
cluster formation. In this case it is thought that the point is picked up 
from a hill side of another group (Fig. 1). The way to jump up and down from 
a hypothetical hill to the others creates distinct distance (G^ value in a 
precise term) gaps between points within the cluster being formed and that of 
other cluster candidates. These distance gaps allow gradual updating of 
cluster characteristic parameter values by first merging cells only closer ^ 

to the centroid. Gradual updating is important in this approach since esti- 
mated parameters at the earlier stage have larger uncertainty factors thrn 
those at later or final steps. Testing membership candidacy for each point, 
merging or rejecting, and updating of the parameters continue until the last 
point is checked. After all the above-mentioned steps are processed, 
searching for the next cluster is repeated.^ The test stops if no single 
cell element is left over (or if the maximum number of clusters set up in the 
program is reached). This procedure is similar in manner to "hill-sliding.'' 

One who is sliding down from the highest point on a hill will eventually 
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arrive at Cha botcoa. Gaoaatvical intarpretatlor of tha algorithm davaloped 
hara in multidimenaional apaca ia not directly comparable to the pathway of 
hlll-aliding by an ob.fact. But tha general procedure may be conaldered an a 
"hill-alldlng" aspect. Actual paths from the present position to tha next lower 
density point will be zigzag motion due to randomnasa of the estimated probe- 
bilii. ' density function in the discrete space. The pathway is always descend- 
ing oi’ leveling, ending at the bottom of a valley. 

Cluster Compactness 

Moat of the measures to examine goodness of clustered results give relative 
comparisons on the basis of original data structure or among clusters thtm- 
selves. The sum of squared errors within clusters and divergence between 
clusters are typical examples of such measures. The former evaluates the 
deviation of cluster samples from each cent'*oid, while the latter measures 
separability between two clusters. In either example, the quantities of the 
measures increase with increasing dimensions of the feature space (Tou ot al., 
1974). Thus, difficulties are encountered In standardizing criteria of these 
measures. It is desirable to formulate a cluster measure Independent of the 
number of variables and the number of sample data employed for clustering. 

A measure is proposed to evaluate the goodness of an individual cluster as 


where 



( 22 ) 


Lj ■ compactness of cluster 1 

Ni •• population of cluster 1 

d •• dimension of feature space 

T • total scatter matrix of the data 

and the determinants (|c^| and |t|) of both matrices, and T, exist. 

The cluster compactness parameter Is a dimensionless quantity. The denomi- 
nator Is constant for a given set of data and has the dimension of length-square. 
It can be considered a characteristic value of the data (say Cl). The deter- 
minant of a scatter matrix is proportional to the product of the variances in 
the direction of the principal axes, which are defined by the canonical trans- 
form of the scatter matrix. It is the volume of a hyper-ellipsoid defined by 
the unit Mahalanobis distance (l.e., Tq.-I (x-ij^)<= 1) from the cluster 

centroid. The volume measures the average scatterness (or squared Euclidean 
distance) of the pattern vectors within the cluster around their mean pattern 
vector. The length between the centroid and a point on the hyper-ellipsoid 
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may be Interpreted ae the mean aquared-error In the direction of the feature 
I space. For this reaaon, the hyper-ellipsoidal volume defined by the dater- 

I minant of a cluster covariance matrix will be called simply the "scatcerness 

volume" of the cluster. The value in the bracket of Eq. 22 is approximately 
I proportional to the average volume per cluster element If the number of ele- 

ments defining the covariance matrix Is sufficiently larger than that of 
, dimensions. Subtraction by d from N or In the denominator of each bracket 

(Is devised for the unbiased estimation of the parameter. Note that pattern 

vectors less than or equal to the number of dimensions cannot form any hyper- 
volume and the covariance matrix of those pattern vectors is always singular 
(.Duda et al., 1973). However, the value of d may be any nonnegative value, 

\ If desired for the purpose of defining the parameter only. 

[ Analysis of the Landsat data in this study indicates that clusters having a 

compactness parameter less than 0.4 are distinctly separable from others and 
that those with a paia.jeter larger than 1 are scattered around in a region 
* rather than distributed normally. 

Separability between Clusters 

Distinctness of a cluster against the rest of the data has been evaluated in 
, terms of various measures, such as Mahalanobls distance and divergence (Duran 

and Odell, 1974). Mahalanobls distance was introduced for a measure of metric 
* distance between two population centroids (Atchley et al., 1975). Its original 

r definition is different from the concept employed here, which is a distance 

i measure between a pattern vector and c cluster centroid. The original formula 

1 uses a pooled covariance matrix of two distributions. Application of this 

' formula to all possible pairs of classes requires conslderablr computational 

time if the number of classes is large. 

, Divergence is another commonly used measure of dissimilarity between two dis- 

tributions (Tou et al., 1974; Swain, 1972). It is defined by the S'lm of ex- 
pectations of log- likelihood ratios in favor of one clase against the ether: 


dx (23) 

X 

The divergence is inferred as the total average information for discrimination 
between two classes. Higher values of the divergence estimates indicate better 
separability between the pair. It also possesses other interesting properties 
(Tou et al., 1974; Swain, 1972). 

The divergence is used in this paper to analyze the clustering performance. 

The major reason for employing the parameter is that it can be computed by a 
simpler formula under Gaussian assumption. For two Gaussian classes with un- 
equal a priori probabilities, Eq. 23 is reduced to 


'ij “ / [Pi (x) - Pj (x)] 


Pj(x) 

Pj(x) 
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i 



Dij - Jj tr [(PiCi - PjCj) (Cj-‘ 


Ci’‘)] 


+ »S tr t(PlCi”‘ 
+ (P^ - Pj) in 



PjlcJ*s ’ 


Oil - Hj) (Ml - Mj)^] 


( 24 ) 


where tr denotes the trace of the matrix in the bracket. This Is an extended 
formula of the relationship usually seen In the literature (Tou et al., 1974) 

In the case of two distributions with different mixing proportions. It Is 
noteworthy that the Mahalanobls generalized distance is the divergence between 
two Gaussian populations with unequal mean vectors but equal a priori proba- 
bilities and covariance matrices (Tou et al., 1974). The divergence Is 
normalized by the sum of two class a priori probabilities as 2Dij/(Pi + Pj). 

The additive property of divergence for Independent variables Indicates that 
no universal value of a divergence criterion Is acceptable for any combinations 
of multivariate measurements. It Is desirable to reduce the effects of dimen- 
sionality as well as the sample size In cluster analysis. For this reason, 
the estimates of divergence divided by the entropy of the data Is used In this 
study, whenever any comparison is made regarding divergence. The entropy E(x) 
is a statistical measure of uncertainty defined by (Young et al., 1974) 


E (X') 


/ P(2i) 

X 


[l/p(x)] dx. 


(25) 


It is Interpreted as the expected value of an information unit, £n[l/p(x)] , 
that Is, the average uncertainty of the Information source. As Indicated by 
its functional form similar to that of divergence, Eq. 23, the entropy possesses 
properties similar to those for divergence. The normalized divergence to the 
entropy given by 


2Uij 

(Pi+Pj) E(x) 


(26) 


is comparable in any combination of variables. This value can determine rel- 
ative separability of one cluster against the other regardless of the number 
of variables employed. Higher values indicate distinctive separability between 
the pair of clusters while smaller ones mean high resemblance of the pairs in 
their data characteristics. 

On the Overall Ob.jectlve of Partitioning 

Remote sensing data of natural scenes may contain countless subcategorlcal 
information on natural land-cover /land-use classes. One of the best partition- 
ing In an established mathematical frame may not satisfy a user (or analyst) 
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wljo desires the class categorical Information at a certain level. Tuning of 
the mathematical goal at a user's desired level Is not easily achievable by a 
numerical scale. Existence of various levels for classification schemes 
(Anderson et al., 1976) Inevitably introduces heuristic parameters to obtain 
the desired level of the resultant classification or clustering. 

To evaluate the performance of clustering, the following overall objective 
function was set forth: 


Ic 

F (N^-d)L^ (27) 

1-1 


subject to 


f^l<?n f Wi) 5 f^j(5n c 

w^) for all 1, j and n. 

(28) 

1 5 Ic, 


(29) 

0 < Lc if Min 

< Dg for all i and j. 

(30) 

Me < Ml < N for all 1 

1 

(31) 


where Dg, L(. and are the number of resultant clusters, tlw nimum ac- 
ceptable value of the normalized divergence, the maximum acceptable value of 
the compactness parameter, and the minimum number of probability cells in a 
cluster, respectively. is the number of cells in cluster i. The minimum 

number Mj. of cells in a cluster should be larger than the number of variates 
(dimensions) d, so that a covariance matrix might not be singular. This is a 
better statement than that N^. < < N where Nc is the minimum number of 

identities required in a cluster. Tne reason is that Mi < Ni and hence it 
gives better assurance of a covariance being nonsingular. Note that a co- 
variance matrix is always singular if Ni < d or Mi < d (Duda et al., 1973). 
Parts of the constraints: 1) 2 1, 2) '' 0, and 3) Mi < N are self- 

evident and there is no requirement for specification of these criteria in 
the algorithm. However, an investigator may input any other desired values 
which do not exceed the limits as parameters. It is also worthwhile to note 
that cluster or class identities less than ten times the dimensionality d will 
usually lead to an increase in probability of error if predictions are made 
based on their covariance matrices (Ball, 1965). The constraint Eq. 28 is 
equivalent to the decision rule of the maximum likelihood classification 
(Eq. 11), since the only other variable in clustering function Gj^(3^) defined 
by Eq. 6 is p(j^)* which is common in both sides. Hence, the inequality, 

Eq. 28, can be called the "maximum likelihood constraint" for each data point. 


The objective function F can be expressed in terms of covariance matrices: 


Ic 

N-d 

^ " 1 t 1 S 1*^11 
' 1=1 


(27a) 
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Minimizing F Is equivalent to minimizing the sum of the determinants of indi- 
vidual cluster covariance matrices. Therefore, the objective of clustering is 
to obtain a partitioning of the data which minimizes the siuu of cluster scatter- 
ness volumes under the imposed constraints. The objective function is generally 
nonlinear and Its usual multidimensional form cannot be described in easily 
manageable terms. 

There arc substantial differences between the present formulation and those 
which use frequently-cited clustering criterion function |w| , where 


- 



(32) 


The simple algebraic sum of all the cluster covariance matrices, W, Is commonly 
referred to the total Intragroups (or pooled-wlthln clusters) scatter matrix 
(Friedman et al., 1967; Fukunaga et al., 1970; Duda ot al . , 1973). It has 
been shown that the determinant of the matrix Is invariant to nonsingular 
linear transformations of the data aiW, is able to produce well-definable 
natural cluster boundaries when it Is used as a clustering criterion (Fukunaga 
et al., 1970). The determinant of the scatter matrix alone is of no use as a 
clustering criterion function if the number of clusters is not knoxjn in ad- 
vance, since more subdivisions of the data space, tend to reduce the value of 
the iletennlnant . An essential difference between the present objective func- 
tion F and the determinant of the total Intragroups scatter matrix, |Wl, us a 
clustering criterion comes from the fact: 

>c Ic 

lwl " I 5^ 0l 1 ^ ll’il . 

1"*1 1-1 


The ilolcrmlnanl iwl has been used as a measure of compactness of the clusters 
(Duda et al., 197J), but this Interpretation is somewhat misleading. The two 
simple examples (Fig. 5) Illustrate the inappropr lateness of using |w| as a 
clustering objective function or an overall cluster compactness measure. 


C 


1 



W - 



W ’ * I' ' 1 




|wj - 16, 

|W I - 25, 






C ' j C ' = 8 . 


The first case Is that two-dimensional covariance matrices of two clusters are 
Identical except for their locutions, and the second that the (wo have the 
same scatterne:^s volumes (determinants) but dilferont orientations and locjitlons 
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tU'nenti of fovananct Motoemf Ihtif 
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Kig. 5. Iliusfrut ion of tho ToLa] Intragroups 
Scatter Matrices for Two Separable 
Cluster in Two-Dimensional Space. Each 
cluster lias the same scattemess volume 
but (.lifferent mean (centroiin from tho 
other in either case. The pooled co- 
variance matrices W(=(’| + C,) anil W'(=C^' 

+ C' 2 ') are different from each other 
since cluster orientations are not the 
same in both cases, even though ICJ = 

Ic.l = Ic^'l = Ic.’l. 

Under the assumption that two clusters are completely separated in both ca.ses, 
their total Intragroups scatter matrices have different shapes and determinant 
values. The first case yields smaller determinant values of tlie resultant 
scatter matrix than the latter does. This indicates tliat minimizing the deter- 
minant of the total Intragroup scatter matrix W forces all tho clusters to be 
partitioned in the shapes and orientations as similarly ns possible. Any set 
of separable clusters would not make much difference whatever their natural 
shapes or orientations. This is another drawback in using jw| criterion for 
clustering. The present clustering formulation has been devised to circumvent 
these difficulties by employing the sum of the determinants of individual 
cluster covariance matrices as the objective function. The set of constraints 
has provided some guidelines to overcome various undesirables aspects commonly 
encountered in clustering the heterogeneous natural scene data in this formula- 
tion. 

The global solution to this optimization problem may be found by a systematic 
but exhaustive enumeration of all partitioning alternatives. Search of the 
solution by such an enumeration is often not permitted due to requirement of 
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excessive computation and memory storage for a large volume of data. It Is 
noted that the objective function Is proportional to (N-d) . Hence, a data 
partition Index to evaluate the clustered results la proposed as a sample- 
size- Independent Indicator: 


PI - [F/(N-d)J ‘/d 




/d 


(33) 


This Index is an overall measure of the goodness for the data partitioning. 
The smaller the Index value Is, the better optimal partitioning is achieved. 


Results and Discussion 


The proposed cluster initialization method was programmed and tested using the 
Landsat data of May 11, 1976, over Chippewa River Basin, Wisconsin. Sample 
data of the Multi-Spectral Scanner (MSS) bands 4 and 5 were extracted from the 
satellite's computer-compatible tape by the Landsat Mapping System of Colorado 
State University, Fort Collins, Colorado (Park et al., 1979). The data covered 
two different locations of the study area. Of the total 200 pixels (picture 
elements), 111 population cells were Identified (Fig. 6). The test was made 
with the input parameters of £g ■ 2.7 and fg “ 1 for the criterion functions, 
Eqs. 18 and 20, respectively. Fifteen initial clusters were formed (Fig. 7), 
but one of them which did not meet the criterion, Eq. 31, was excluded from 
various cluster evaluations (Table 1). 


The results suggested that there would be 
a compact cluster if < 0.2, 
a scattered cluster if > 0.4, and 

a well-separated cluster from the others If G^j > 50 for all j i. 

Several test runs with varying input values of both parameters yielded similar 
partitioned patterns of the data and most of significant clusters (like 
clusters 1 through 5) were Identified. 

Summary 

A method was developed to extract significant Gaussian clusters on the basis 
of discrete multivariate probability density estimates. The key idea was that 
a proposed clustering function could distinguish elements of a cluster under 
formation from the rest without knowledge of the other clusters. The algorithm 
described liere showed effectiveness In extracting Gausslan-type clusters from 
Landsat MSS data. The partitioning obtained by this technique was not optimal, 
but it could be used as reasonable input data to other iterative clustering 
programs for further improvement. 
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A dimensionless cluster compactness parameter was set forth as a universal 
measure of cluster goodness and used favorably in test runs. Separability 
between clusters was examined by employing a normalized divergence which was 
defined by the divergence divided by the entropy of the entire sample data. 
The overall clustering objective function and the partition index were form- 
ulated in terms of cluster covariance matrices. They were good indicators 
of the overall achievement for evaluation of the partitioned results obtained 
under various conditions or at different stages. 
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Fig. 6., Bivariate Population Distribution of Landsat 
Sample Data. Cluster seeding points were 
identified by the hlll-slldlng algorithm. 
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Fig. 7. 


Initial Clustars of the Test Data. Th® 
numbers are cluster labels. 
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